Building an Arabic news transcription system with web-crawled resources

نویسندگان

  • Arianna Bisazza
  • Roberto Gretter
چکیده

This paper describes our efforts to build an Arabic ASR system with web-crawled resources. We first describe the processing done to handle Arabic text in general and more particularly to cope with the high number of different phonetic transcriptions associated to a typical Arabic word. Then, we present our experiments to build acoustic models using only audio data found in the web, in particular on the Euronews portal. To transcribe the downloaded audio we compare two approaches: the first uses a baseline trained on manually transcribed Arabic corpora, while the second uses a universal ASR system trained on automatically transcribed speech data of 8 languages (not including Arabic). We demonstrate that with this approach we are able to obtain recognition performances comparable to the ones obtained with a fully supervised Arabic baseline. keywords: Arabic, ASR, Morphological segmentation, Lightly supervised training, Under-resourced languages

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Digital Library and Archiving for Qatar

Crawling and Indexing Qatari Scholarly ContentSeerQ SeerSuite is a collection management system for digital libraries, developed at Penn State. It includes: 1) A Web crawler for scholarly articles; 2) A machine learning based automated system for metadata (title, abstract, author name/affiliation, citations) extraction; 3) A module for ingesting extracted information into a database and Solr; a...

متن کامل

DC Proposal: Model for News Filtering with Named Entities

In this paper we introduce the project of our PhD thesis. The subject is a model for news articles filtering. We propose a framework combining information about named entities extracted from news articles with article texts. Named entities are enriched with additional attributes crawled from semantic web resources. These properties are then used to enhance the filtering results. We described va...

متن کامل

Towards Supporting Exploratory Search over the Arabic Web Content: The Case of ArabXplore

Due to the huge amount of data published on the Web, the Web search process has become more difficult, and it is sometimes hard to get the expected results, especially when the users are less certain about their information needs. Several efforts have been proposed to support exploratory search on the web by using query expansion, faceted search, or supplementary information extracted from exte...

متن کامل

Collection and Evaluation of Broadcast News Data for Arabic

This paper focuses on presenting a general methodology for acquiring and automatically segmenting broadcast news data from the web. It was shown that it is possible starting from a relatively small corpus of about 10 hours to segment automatically about 30 hours of data. This step is important because manual segmentation of broadcast news data is generally very tedious and time consuming. In ad...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013